🚀 LLM Inference Servers

A Complete Journey from Basics to Advanced Concepts

đŸŽ¯ The Fundamentals: How LLM Inference Actually Works

đŸŊī¸ The Restaurant Analogy

Think of an LLM like a master chef in a restaurant. When you give them a recipe (prompt), they don't cook the entire meal at once. Instead, they prepare it one ingredient at a time, tasting and adjusting as they go. Each "ingredient" is like a token (word or part of a word) that the model generates one by one.

Step 1: Understanding Tokens

What are Tokens?

Hello
my
name
is
Alice

Tokens are the basic units that LLMs understand - they can be words, parts of words, or even punctuation!

Step 2: The Two Phases of Inference

🔄 PREFILL Phase

Process the entire prompt in parallel

Fast but memory-intensive

📝 DECODE Phase

Generate tokens one by one

Slow but predictable

đŸƒâ€â™‚ī¸ The Reading vs Writing Analogy

Prefill is like speed-reading an entire book (your prompt) very quickly to understand the context.

Decode is like writing a response letter word by word, thinking carefully about each word before writing the next one.

Step 3: What Makes This Challenging?

🚨 The Big Problem: Each new token depends on ALL previous tokens!

This creates a memory bottleneck and limits how fast we can generate text.

# Simple example of how inference works prompt = "The weather today is" # Prefill: Process entire prompt at once context = model.process_prompt(prompt) # Decode: Generate one token at a time tokens = [] for i in range(max_tokens): next_token = model.predict_next(context + tokens) tokens.append(next_token) if next_token == "<END>": break # Result: "The weather today is sunny and warm"

🧠 KV Cache: The Memory That Makes Everything Fast

🎓 The Study Group Analogy

Imagine you're in a study group discussing a complex topic. Instead of re-reading the entire textbook every time someone asks a question, you keep detailed notes (KV Cache) of everything discussed so far. When a new question comes up, you can quickly reference your notes instead of starting from scratch!

What is KV Cache?

🔑 Key-Value Cache Explained

Keys (K): Help the model find relevant information

Values (V): Store the actual information content

K1
V1
K2
V2
K3
V3
K4
V4
K5
V5

Each token gets its own Key-Value pair stored in memory

Why is KV Cache Necessary?

❌ Without KV Cache (Inefficient)

Token 1: Process [The]

Token 2: Process [The, cat] ← Recalculate everything!

Token 3: Process [The, cat, sat] ← Recalculate everything again!


✅ With KV Cache (Efficient)

Token 1: Process [The] → Store K1,V1

Token 2: Use K1,V1 + Process [cat] → Store K2,V2

Token 3: Use K1,V1,K2,V2 + Process [sat] → Store K3,V3

The Memory Challenge

đŸ”Ĩ For a 70B model like Llama 3.3:

â€ĸ Each token needs ~800KB of KV cache

â€ĸ A 2048-token conversation = 1.6GB just for cache!

â€ĸ This grows linearly with context length

🏠 The Library Analogy

Think of KV cache like a growing library. Each new book (token) you add needs shelf space (memory). As your library grows, you need more and more shelves. Eventually, you run out of space and need clever storage solutions!

KV Cache Optimizations

đŸ—œī¸ Quantization

Compress the cache data from 16-bit to 8-bit or even 4-bit numbers. Less accurate but much smaller!

📄 Paging

Break cache into small "pages" like computer memory. Avoid wasting space on unused memory!

💾 Offloading

Move old cache data to CPU memory or disk when GPU memory gets full.

⚡ Batching: Serving Multiple Users Efficiently

🚌 The Bus Route Analogy

Imagine you're running a bus service. You could send a separate bus for each passenger (inefficient), or you could group passengers going in the same direction and use one big bus (efficient batching)!

Static vs Continuous Batching

❌ Static Batching (Old Way)

User A
5 tokens
User B
50 tokens
User C
5 tokens
→ Everyone waits for User B to finish!

✅ Continuous Batching (New Way)

User A
✓ Done
User B
Token 25/50
User D
New!
→ Users come and go dynamically!

PagedAttention: The Smart Memory Manager

🧩 PagedAttention Explained

Instead of reserving huge chunks of memory for each user, PagedAttention divides memory into small "pages" and assigns them as needed - just like how your computer's operating system manages memory!

❌ Traditional
Reserved: 2048 tokens
Used: 100 tokens
95% wasted!
✅ PagedAttention
Allocated: 100 tokens
Used: 100 tokens
0% wasted!

Performance Impact

🚀 Throughput Improvements

HuggingFace Transformers
1x
Text Generation Inference
3.5x
vLLM (with PagedAttention)
24x

🔄 Disaggregated Serving: Separating Prefill from Decode

🏭 The Factory Assembly Line Analogy

Imagine a factory where one team is really good at preparing ingredients (prefill) and another team excels at final assembly (decode). Instead of having each worker do both tasks poorly, you separate them into specialized stations for maximum efficiency!

The Problem with Traditional Serving

😤 Interference Issues

Prefill: Needs lots of compute, short burst

Decode: Needs consistent memory access, long duration

Together: They fight for resources and slow each other down!

Disaggregated Architecture

🔄 Prefill Cluster

Specialized for parallel processing

Optimized for TTFT

📡 KV Transfer

High-speed network

~17ms for 2048 tokens

📝 Decode Cluster

Specialized for sequential generation

Optimized for throughput

Benefits of Disaggregation

⚡ Better Latency

Prefill doesn't interfere with decode operations. Users get consistent response times.

đŸŽ¯ Specialized Hardware

Use different GPU configurations optimized for each phase's specific needs.

📈 Higher Throughput

Up to 7x higher request rates with the same SLA requirements.

💰 Cost Efficiency

Scale prefill and decode independently based on actual demand patterns.

Real-World Example

đŸĸ OpenAI and Google use disaggregated serving!

For ChatGPT: Prefill cluster handles your prompt, then hands off to decode cluster for streaming response generation.

🔮 Speculative Decoding: Predicting the Future

đŸŽ¯ The Chess Master Analogy

Imagine a chess master playing against a computer. The master can quickly think of several good moves (draft), then the computer carefully verifies which ones are actually legal and best (verification). This way, multiple moves can be planned in the time it usually takes to plan one!

How Speculative Decoding Works

đŸƒâ€â™‚ī¸ Draft Model

Small, fast model

Generates 3-5 candidate tokens

🔍 Verification

Large target model

Checks all candidates in parallel

✅ Accept/Reject

Keep good predictions

Reject bad ones

Types of Speculative Decoding

🤖 Separate Draft Model

Use a smaller version of the same model (e.g., 7B drafting for 70B)

Speed: 2-3x faster

🔄 Self-Speculative

Use the same model with some layers skipped for drafting

Speed: 1.5-2x faster

đŸŽ¯ Medusa Heads

Add multiple prediction heads to the main model

Speed: 2-3x faster

🔍 Prompt Lookup

Reuse tokens that already appeared in the prompt

Speed: Very context-dependent

Real Example

đŸŽ¯ Speculative Decoding in Action

Current context: "The capital of France is"

Draft Model predicts: ["Paris", "located", "in"]
Target Model verifies:
✅ "Paris" - Correct!
❌ "located" - Rejects, generates "and"
âšī¸ "in" - Not checked (sequence broken)
Result: Accepted "Paris", generated "and"
Progress: 2 tokens in 1 forward pass! 🚀

Performance Benefits

🚀 Google AI Overviews uses speculative decoding for 2-4x speedup

⚡ Perfect for scenarios where GPU is underutilized (small batch sizes)

đŸŽ¯ Best results when draft model has 60-80% acceptance rate

đŸĻ™ Ollama & GGUF: Running Models on Your Laptop

📱 The Mobile App Analogy

Think of GGUF files like mobile apps that are optimized to run on your phone instead of requiring a powerful desktop computer. They're compressed and efficient versions of the full models that can run on consumer hardware!

What is GGUF?

đŸ—œī¸ GGUF (GGML Universal File)

G: Georgi (creator's name)

G: Gerganov (creator's surname)

U: Universal

F: File format


A special file format that stores AI models in a compressed, CPU-friendly way!

Quantization Levels

📊 Q4_K_M

Size: ~4GB (70B model)

Quality: Good balance

Speed: Fast

📊 Q8_0

Size: ~8GB (70B model)

Quality: High quality

Speed: Moderate

📊 F16

Size: ~140GB (70B model)

Quality: Original quality

Speed: Slower

How Ollama Works

đŸ“Ĩ Download

Pull GGUF model from Hugging Face or Ollama registry

🔧 Load

Load into system memory (RAM/GPU)

đŸ’Ŧ Chat

Start chatting with OpenAI-compatible API

Running a Model Example

# Install Ollama curl -fsSL https://ollama.ai/install.sh | sh # Pull a model (automatically downloads GGUF) ollama pull llama3.3:70b-instruct-q4_K_M # Chat with the model ollama run llama3.3:70b-instruct-q4_K_M >>> Hello! How do you work? I'm an AI running locally on your machine using a quantized GGUF format that compresses my 70B parameters down to just 4 bits per parameter... # Use via API curl http://localhost:11434/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "llama3.3:70b-instruct-q4_K_M", "messages": [ {"role": "user", "content": "Explain inference servers"} ] }'

GGUF vs Traditional Models

đŸ—ī¸ Traditional PyTorch Model

â€ĸ Requires GPU with 80GB+ VRAM

â€ĸ Full 16-bit precision weights

â€ĸ Complex serving infrastructure

â€ĸ Size: ~140GB for 70B model

đŸĻ™ GGUF Model with Ollama

â€ĸ Runs on laptop with 32GB RAM

â€ĸ Quantized 4-bit weights

â€ĸ Simple single-command setup

â€ĸ Size: ~40GB for 70B model

Performance Considerations

đŸ’ģ CPU Inference: 1-5 tokens/second on M2 MacBook

đŸ–Ĩī¸ GPU Inference: 10-50 tokens/second on RTX 4090

🧠 Memory Usage: Model size + 2-4GB for context

🏭 Inference Server Comparison: Choosing Your Champion

đŸŽī¸ The Racing Car Analogy

Different inference servers are like different types of racing cars. A Formula 1 car (TensorRT-LLM) is fastest on a professional track, a rally car (vLLM) works great in various conditions, and a family sedan (TGI) is reliable and easy to drive everywhere!

Server Performance Comparison

🏆 Throughput (Tokens/Second) - Llama 3 70B @ 100 Users

LMDeploy
700 t/s
TensorRT-LLM
700 t/s
vLLM
650 t/s
TGI
650 t/s

⚡ Time to First Token (Lower is Better)

vLLM
Best
TGI
Good
TensorRT-LLM
Variable
LMDeploy
Excellent

Detailed Server Breakdown

🚀 vLLM

Best for: Research, fast TTFT

Strengths: PagedAttention, easy setup

Hardware: NVIDIA, AMD, Intel

Weaknesses: Newer, less enterprise features

⚡ TensorRT-LLM

Best for: Maximum NVIDIA GPU performance

Strengths: Fastest on NVIDIA, FP8 support

Hardware: NVIDIA only

Weaknesses: Complex setup, compilation needed

đŸ›Ąī¸ Triton Inference Server

Best for: Enterprise, multi-framework

Strengths: Mature, supports any model

Hardware: All major platforms

Weaknesses: Complex configuration

🤗 Text Generation Inference

Best for: HuggingFace ecosystem, beginners

Strengths: Easy setup, good docs

Hardware: Broad support

Weaknesses: Not always fastest

Decision Tree

🤔 Do you need maximum speed?

YES → TensorRT-LLM
NO → Continue...

đŸ› ī¸ Do you want easy setup?

YES → vLLM or TGI
NO → Continue...

đŸĸ Do you need enterprise features?

YES → Triton
NO → vLLM

Real-World Usage

đŸ”Ĩ Production at Scale: Most companies use multiple servers!

📊 Example: TensorRT-LLM for high-throughput batch inference + vLLM for interactive chat

🔄 Trend: Moving toward disaggregated architectures with specialized servers for prefill vs decode